Machine Learning: Personal Loan Campaign

Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data Scientist at AllLife Bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and to identify which segment of customers to target more.

Data Dictionary

ID: Customer ID

Age: Customer’s age in completed years

Experience: # years of professional experience

Income: Annual income of the customer (in thousand dollars)

ZIP Code: Home Address ZIP code.

Family: The family size of the customer

CCAvg: Average spending on credit cards per month (in thousand dollars)

Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional

Mortgage: Value of house mortgage if any. (in thousand dollars)

Personal_Loan: Did this customer accept the personal loan offered in the last campaign?

Securities_Account: Does the customer have a securities account with the bank?

CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?

Online: Do customers use Internet banking facilities?

CreditCard: Does the customer use a credit card issued by any other Bank (excluding All Life Bank)?

Let us start by importing the required libraries

In [371]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# to split data into training and test sets
from sklearn.model_selection import train_test_split

# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree


# to compute classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
)

import warnings
warnings.filterwarnings("ignore")

Analysing Data

In [372]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [373]:
Loan_Model_df = pd.read_csv("/content/drive/MyDrive/AI-ML/M02/Loan_Modelling.csv")
In [374]:
Loan_Model_df.head(5)
Out[374]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [375]:
Loan_Model_df.sample(5)
Out[375]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
1577 1578 34 8 65 92093 1 3.0 1 227 1 0 0 1 0
2984 2985 54 28 94 92709 2 1.1 1 188 0 0 0 0 0
4075 4076 30 4 40 90601 4 0.8 1 0 0 0 0 1 0
3284 3285 25 -1 101 95819 4 2.1 3 0 0 0 0 0 1
2665 2666 35 9 105 90064 2 4.5 3 0 0 0 0 0 0
In [376]:
Loan_Model_df.shape

print("Shape of the Data Set (rows X columns) : ",Loan_Model_df.shape)
print("Total Number of Data (Quantity Of Data) : ",Loan_Model_df.shape[0])
print("Total Number of Faetures (Number Of Columns) : ",Loan_Model_df.shape[1])
Shape of the Data Set (rows X columns) :  (5000, 14)
Total Number of Data (Quantity Of Data) :  5000
Total Number of Faetures (Number Of Columns) :  14
In [377]:
Loan_Model_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
In [378]:
Loan_Model_df.describe(include="all").T
Out[378]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
In [379]:
Loan_Model_df.isnull().sum()
Out[379]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0

In [380]:
Working_Loan_Model_df = Loan_Model_df.copy()

Working_Loan_Model_df
Working_Loan_Model_df.describe(include="all").T
Out[380]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
In [381]:
Loan_Model_features = Working_Loan_Model_df.select_dtypes(include=["int64","float64"])
Loan_Model_features
Out[381]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

5000 rows × 14 columns

ID: Unique customer identifier, no analytical significance.

Age: Customers are mostly middle-aged, averaging around 45 years.

Experience: Closely follows age, averaging 20 years; negative values indicate data errors.

Income: Average annual income is about $74K, with wide variation among customers.

ZIPCode: Represents customer location, not directly useful for modeling.

Family: Most customers have small families of 2–3 members.

CCAvg: Average monthly credit card spending is $1.9K, indicating varied spending behavior.

Education: Majority are graduates or professionals; higher education links to loan acceptance.

Mortgage: About half of customers have no mortgage; others vary widely up to $635K.

Personal_Loan: Only ~9.6% of customers accepted a personal loan, showing class imbalance.

Securities_Account: Around 10% hold a securities account with the bank.

CD_Account: Only 6% have a CD account; these customers show higher loan interest.

Online: Nearly 60% use online banking — potential for digital marketing.

CreditCard: 29% have a credit card with another bank, useful for cross-selling.

Univarent Analysis

In [382]:
plt.figure(figsize=(15, 10))

features_for_univarent_analysis = Working_Loan_Model_df.columns.tolist()
print(features_for_univarent_analysis)


for i , feature in enumerate(features_for_univarent_analysis):
    plt.subplot(5, 3, i+1)
    sns.histplot(data=Working_Loan_Model_df, x=feature)

    plt.tight_layout();
['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [383]:
plt.figure(figsize=(15, 10))

features_for_univarent_analysis = Working_Loan_Model_df.columns.tolist()
print(features_for_univarent_analysis)


for i , feature in enumerate(features_for_univarent_analysis):
    plt.subplot(5, 3, i+1)
    sns.boxplot(data=Working_Loan_Model_df, x=feature)

    plt.tight_layout();
['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [384]:
print(100*Working_Loan_Model_df['Personal_Loan'].value_counts(normalize=True), '\n')

# plotting the count plot for Personal_Laon
sns.countplot(data=Working_Loan_Model_df, x='Personal_Loan');
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64 

The dataset shows that 90.4% of customers did not take a personal loan, while only 9.6% opted.

In [385]:
print(100*Working_Loan_Model_df['Online'].value_counts(normalize=True), '\n')

# plotting the count plot for Online
sns.countplot(data=Working_Loan_Model_df, x='Online');
Online
1    59.68
0    40.32
Name: proportion, dtype: float64 

About 59.7% of customers use internet banking, while 40.3% do not.

In [386]:
print(100*Working_Loan_Model_df['CreditCard'].value_counts(normalize=True), '\n')

# plotting the count plot for CreditCard
sns.countplot(data=Working_Loan_Model_df, x='CreditCard');
CreditCard
0    70.6
1    29.4
Name: proportion, dtype: float64 

Approximately 29.4% of customers have a credit card from another bank, while 70.6% do not.

BiVarent Anaylsis

In [387]:
# Scatter plot matrix
plt.figure(figsize=(16, 12))
sns.pairplot(Working_Loan_Model_df, vars=features_for_bivarent_analysis, hue='Personal_Loan', diag_kind='kde');
<Figure size 1600x1200 with 0 Axes>
In [388]:
# defining the size of the plot
plt.figure(figsize=(12, 7))

features_for_bivarent_analysis = ['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
# plotting the heatmap for correlation
sns.heatmap(
    Working_Loan_Model_df[features_for_bivarent_analysis].corr(),annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="YlGnBu"
);

Strongest correlations:

Income : 0.50

CCAvg (average credit card spending) : 0.37

CD_Account : 0.32

These are positively correlated with Personal_Loan, meaning customers with higher income, higher average credit card spending, or investment accounts are more likely to take a personal loan.

Weak correlations:

Age, Experience, Family, Education, Mortgage : correlations close to 0 These features have very little influence on whether a customer takes a personal loan.

features (Online, CreditCard, Securities_Account) show very low correlation ( 0.02 - 0.00 ) with Personal_Loan, so they are not strong predictors individually.

Overall Insight:

Financial strength indicators (Income, CCAvg, CD_ccounts) are the key drivers for Personal_Loan uptake.

Demographics and usage of banking services have minimal impact on predicting Personal_Loan.

In [389]:
# Income vs Approved (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(data=Working_Loan_Model_df, x='Personal_Loan', y='Income')
plt.title('Income vs Personal_Loan (Boxplot)')
plt.show()

Income is expected to be a key driver, as customers with higher annual incomes generally possess greater financial capacity, making them more likely to accept a personal loan offer.

In [390]:
# Income vs Approved (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(data=Working_Loan_Model_df, x='Personal_Loan', y='Family')
plt.title('Family vs Personal_Loan (Boxplot)')
plt.show()

While very large or small family sizes might influence financial stability, customers with a median family size of 3 are often in a phase of life that increases financial commitments, making them more receptive to personal loan offers.

In [391]:
# Income vs Approved (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(data=Working_Loan_Model_df, x='Personal_Loan', y='CCAvg')
plt.title('CCAvg vs Personal_Loan (Boxplot)')
plt.show()

Customers with a higher average monthly credit card spending (CCAvg) may be more likely to purchase a personal loan, as this spending indicates a greater financial need or a comfort level with leveraging credit that makes a loan appealing.

In [392]:
# Income vs Approved (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(data=Working_Loan_Model_df, x='Personal_Loan', y='Education')
plt.title('Education vs Personal_Loan (Boxplot)')
plt.show()

Higher education correlates positively with loan acceptance.

Graduate or Professional education levels correlate with higher acceptance.

In [393]:
# Income vs Approved (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(data=Working_Loan_Model_df, x='Personal_Loan', y='CD_Account')
plt.title('CD_Account vs Personal_Loan (Boxplot)')
plt.show()

Despite the general expectation that CD_Account holders prioritize savings, the moderate positive correlation (0.32) observed in the data suggests the opposite. Customers with a Certificate of Deposit (CD_Account) are more likely to accept a personal loan.

In [394]:
# Income vs Approved (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(data=Working_Loan_Model_df, x='Personal_Loan', y='CreditCard')
plt.title('CreditCard vs Personal_Loan (Boxplot)')
plt.show()

he feature CreditCard (use of a credit card from another bank) is shown to have no impact on the likelihood of accepting a personal loan, as indicated by a correlation of 0.0. This suggests that owning a non-AllLife Bank credit card is neither a predictor of financial need nor financial aversion for this specific loan product.

In [395]:
Working_Loan_Model_df.loc[Working_Loan_Model_df["Experience"]<0, "Experience"] = np.nan
Working_Loan_Model_df["Experience"].fillna(Working_Loan_Model_df["Experience"].median(), inplace=True)

for c in ["ID","ZIPCode"]:
    if c in Working_Loan_Model_df.columns: Working_Loan_Model_df.drop(c, axis=1, inplace=True)

During feature cleaning, negative values in the Experience column were treated as missing (NaN) and subsequently imputed using the median of the column, while the ID and ZIP Code columns were dropped as they are unique identifiers with no predictive power.

In [396]:
# defining the explanatory (independent) and response (dependent) variables
X = Working_Loan_Model_df.drop(["Personal_Loan"], axis=1)
y = Working_Loan_Model_df["Personal_Loan"]
In [397]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
In [398]:
print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape, '\n')
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
Shape of training set: (4000, 11)
Shape of test set: (1000, 11) 

Percentage of classes in training set:
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64 

Percentage of classes in test set:
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64

After splitting the dataset into training and test sets, the resulting sample sizes were 4,000 and 1,000 rows, respectively, with both sets demonstrating excellent stratification by maintaining the original class distribution of approximately 90.4% non-loan customers (0) and 9.6% personal loan customers (1).

In [399]:
dtree1 = DecisionTreeClassifier(random_state=42)    # random_state sets a seed value and enables reproducibility

# fitting the model to the training data
dtree1.fit(X_train, y_train)
Out[399]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Evaluation

We define a utility function to collate all the metrics into a single data frame, and another to plot the confusion matrix.

In [400]:
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)
    recall = recall_score(target, pred)
    precision = precision_score(target, pred)
    f1 = f1_score(target, pred)

    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [401]:
def plot_confusion_matrix(model, predictors, target):
    """
    To plot the confusion_matrix with percentages
    """

    y_pred = model.predict(predictors)

    cm = confusion_matrix(target, y_pred)

    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [402]:
plot_confusion_matrix(dtree1, X_train, y_train)
In [403]:
dtree1_train_perf = model_performance_classification(
    dtree1, X_train, y_train
)
dtree1_train_perf
Out[403]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [404]:
plot_confusion_matrix(dtree1, X_test, y_test)
In [405]:
dtree1_test_perf = model_performance_classification(
    dtree1, X_test, y_test
)
dtree1_test_perf
Out[405]:
Accuracy Recall Precision F1
0 0.982 0.947917 0.875 0.91

The initial Decision Tree model achieved perfect metrics (Accuracy, Recall, Precision, F1-score of 1.0) on the training data, but the noticeable drop in performance on the test set (e.g., Precision of 0.875 and F1-score of 0.91) indicates significant overfitting and necessitates model pruning for better generalization.

In [406]:
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 20))

out = tree.plot_tree(
    dtree1,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")    # set arrow color to black
        arrow.set_linewidth(1)          # set arrow linewidth to 1

# displaying the plot
plt.show()
  • We can observe that this is a very complex tree.
In [407]:
print(
    tree.export_text(
        dtree1,    # specify the model
        feature_names=feature_names,    # specify the feature names
        show_weights=True    # specify whether or not to show the weights associated with the model
    )
)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 27.00
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Age >  27.00
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |   |   |   |--- Income <= 63.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Income >  63.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |   |   |   |--- weights: [29.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- Mortgage <= 249.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  249.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |--- weights: [47.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |--- weights: [19.00, 0.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- CCAvg <= 3.95
|   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  3.95
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- Mortgage <= 38.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  38.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Family >  1.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 33.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  33.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [506.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- CCAvg <= 1.45
|   |   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  1.45
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [36.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |--- Experience <= 4.50
|   |   |   |   |   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |   |   |   |   |--- Age <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Age >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Experience >  4.50
|   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.05
|   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  1.05
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- Mortgage <= 265.50
|   |   |   |   |   |   |--- Experience <= 35.50
|   |   |   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- Experience >  35.50
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  265.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 244.00] class: 1

In [407]:

Decision Tree (Pre-pruning)

In [408]:
# define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = np.arange(10, 51, 10)
min_samples_split_values = np.arange(10, 51, 10)

# initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')

# iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                random_state=42
            )

            estimator.fit(X_train, y_train)

            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            train_f1_score = f1_score(y_train, y_train_pred)
            test_f1_score = f1_score(y_test, y_test_pred)

            score_diff = abs(train_f1_score - test_f1_score)

            if score_diff < best_score_diff:
                best_score_diff = score_diff
                best_estimator = estimator
In [409]:
# creating an instance of the best model
dtree2 = best_estimator

dtree2.fit(X_train, y_train)
Out[409]:
DecisionTreeClassifier(max_depth=np.int64(6), max_leaf_nodes=np.int64(30),
                       min_samples_split=np.int64(10), random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [410]:
plot_confusion_matrix(dtree2, X_train, y_train)
In [411]:
dtree2_train_perf = model_performance_classification(
    dtree2, X_train, y_train
)
dtree2_train_perf
Out[411]:
Accuracy Recall Precision F1
0 0.98775 0.914062 0.956403 0.934754
In [412]:
plot_confusion_matrix(dtree2, X_test, y_test)
In [413]:
dtree2_test_perf = model_performance_classification(
    dtree2, X_test, y_test
)
dtree2_test_perf
Out[413]:
Accuracy Recall Precision F1
0 0.986 0.958333 0.901961 0.929293

After pre-pruning the Decision Tree (dtree2), the performance gap between the training and test sets has significantly closed, resulting in a more generalized model that achieves a high Accuracy of 0.986 and a balanced F1-score of 0.929 on the unseen test data, indicating that the overfitting issue has been successfully addressed.

In [414]:
# list of feature names in X_train
feature_names = list(X_train.columns)

# set the figure size for the plot
plt.figure(figsize=(20, 20))
out = tree.plot_tree(
    dtree2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

# displaying the plot
plt.show()
In [415]:
# printing a text report showing the rules of a decision tree
print(
    tree.export_text(
        dtree2,
        feature_names=feature_names,
        show_weights=True
    )
)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 27.00
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Age >  27.00
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- weights: [61.00, 11.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- weights: [66.00, 1.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |--- weights: [3.00, 3.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- weights: [3.00, 1.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- weights: [2.00, 2.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- weights: [3.00, 3.00] class: 0
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [506.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- weights: [2.00, 5.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [36.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |--- weights: [28.00, 12.00] class: 0
|   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- Mortgage <= 265.50
|   |   |   |   |   |   |--- weights: [12.00, 13.00] class: 1
|   |   |   |   |   |--- Mortgage >  265.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- weights: [2.00, 7.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 244.00] class: 1

Decision Tree (Post-pruning)

In [416]:
# Create an instance of the decision tree model
clf = DecisionTreeClassifier(random_state=42)

path = clf.cost_complexity_pruning_path(X_train, y_train)

ccp_alphas = abs(path.ccp_alphas)

impurities = path.impurities
In [417]:
pd.DataFrame(path)
Out[417]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000244 0.000487
2 0.000246 0.000980
3 0.000296 0.001869
4 0.000306 0.002788
5 0.000331 0.003780
6 0.000333 0.004113
7 0.000333 0.004446
8 0.000333 0.004780
9 0.000350 0.005830
10 0.000373 0.007321
11 0.000375 0.007696
12 0.000381 0.008077
13 0.000400 0.008477
14 0.000417 0.008894
15 0.000419 0.011410
16 0.000455 0.011865
17 0.000493 0.012850
18 0.000542 0.013934
19 0.000550 0.016133
20 0.000579 0.020187
21 0.000584 0.020771
22 0.000779 0.021550
23 0.000823 0.022373
24 0.000831 0.023204
25 0.000870 0.024945
26 0.002424 0.027369
27 0.002667 0.030036
28 0.003000 0.033036
29 0.003753 0.036789
30 0.020023 0.056812
31 0.021549 0.078361
32 0.047604 0.173568
In [418]:
# Create a figure
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")

ax.set_xlabel("Effective Alpha")
ax.set_ylabel("Total impurity of leaves")
ax.set_title("Total Impurity vs Effective Alpha for training set");
In [419]:
# Initialize an empty list to store the decision tree classifiers
clfs = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)

    clf.fit(X_train, y_train)
    clfs.append(clf)

print(
    "Number of nodes in the last tree is {} with ccp_alpha {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is 1 with ccp_alpha 0.04760359071815694
In [420]:
# Remove the last classifier and corresponding ccp_alpha value from the lists
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]

depth = [clf.tree_.max_depth for clf in clfs]

fig, ax = plt.subplots(2, 1, figsize=(10, 7))

ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs Alpha")

# Plot the depth of tree versus ccp_alphas on the second subplot
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs Alpha")

# Adjust the layout of the subplots to avoid overlap
fig.tight_layout()
In [421]:
train_f1_scores = []  # Initialize an empty list to store F1 scores for training set for each decision tree classifier

# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
    pred_train = clf.predict(X_train)
    f1_train = f1_score(y_train, pred_train)
    train_f1_scores.append(f1_train)
In [422]:
test_f1_scores = []

for clf in clfs:
    pred_test = clf.predict(X_test)
    f1_test = f1_score(y_test, pred_test)
    test_f1_scores.append(f1_test)
In [423]:
# Create a figure
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("Alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs Alpha for training and test sets")

ax.plot(ccp_alphas, train_f1_scores, marker="o", label="training", drawstyle="steps-post")

ax.plot(ccp_alphas, test_f1_scores, marker="o", label="test", drawstyle="steps-post")

ax.legend();  # Add a legend to the plot
In [424]:
# creating the model where we get highest test F1 Score
index_best_model = np.argmax(test_f1_scores)

dtree3 = clfs[index_best_model]
print(dtree3)
DecisionTreeClassifier(ccp_alpha=np.float64(0.0008702884311333967),
                       random_state=42)
In [425]:
plot_confusion_matrix(dtree3, X_train, y_train)
In [426]:
dtree3_train_perf = model_performance_classification(
    dtree3, X_train, y_train
)
dtree3_train_perf
Out[426]:
Accuracy Recall Precision F1
0 0.98475 0.885417 0.952381 0.917679
In [427]:
plot_confusion_matrix(dtree3, X_test, y_test)
In [428]:
dtree3_test_perf = model_performance_classification(
    dtree3, X_test, y_test
)
dtree3_test_perf
Out[428]:
Accuracy Recall Precision F1
0 0.991 0.958333 0.948454 0.953368

The post-pruned Decision Tree (dtree3) is the final chosen model. It is the best-performing and most generalized model from the tuning process, achieving an Accuracy of 0.991, a Recall of 0.958, and a Precision of 0.948 on the test set, culminating in the highest F1-score of 0.953. This performance ensures that the marketing department can target customers with high confidence, minimizing the cost of reaching out to non-buyers while successfully capturing nearly 96% of all potential loan takers.

In [429]:
# list of feature names in X_train
feature_names = list(X_train.columns)

# set the figure size for the plot
plt.figure(figsize=(10, 7))

# plotting the decision tree
out = tree.plot_tree(
    dtree3,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# add arrows to the decision tree splits if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")    # set arrow color to black
        arrow.set_linewidth(1)          # set arrow linewidth to 1

# displaying the plot
plt.show()
In [430]:
print(
    tree.export_text(
        dtree3,
        feature_names=feature_names,
        show_weights=True
    )
)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- weights: [136.00, 21.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 10.00] class: 1
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [531.00, 5.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- weights: [12.00, 6.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- weights: [75.00, 12.00] class: 0
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- weights: [12.00, 25.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [2.00, 251.00] class: 1

In [430]:

Model Performance Comparison and Final Model Selection

In [431]:
models_train_comp_df = pd.concat(
    [
        dtree1_train_perf.T,
        dtree2_train_perf.T,
        dtree3_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[431]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 0.987750 0.984750
Recall 1.0 0.914062 0.885417
Precision 1.0 0.956403 0.952381
F1 1.0 0.934754 0.917679

The performance comparison on the training set clearly illustrates the process of regularization: the initial default Decision Tree with perfect metrics (1.0) was heavily overfit; pre-pruning intentionally reduced training performance (F1-score 0.935) to build a more robust, generalized model, while post-pruning further simplified the tree, resulting in a slightly lower training F1-score (0.918), but ultimately yielding the best-performing, most generalized model on the unseen test data.

In [432]:
models_test_comp_df = pd.concat(
    [
        dtree1_test_perf.T,
        dtree2_test_perf.T,
        dtree3_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[432]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.982000 0.986000 0.991000
Recall 0.947917 0.958333 0.958333
Precision 0.875000 0.901961 0.948454
F1 0.910000 0.929293 0.953368

The test set comparison confirms that pruning significantly improved model generalization: the post-pruned Decision Tree delivers the best overall performance, achieving the highest Accuracy (0.991) and F1-score (0.953; critically, its high Precision (0.948) compared to the default model (0.875) makes it the most effective model for targeting potential loan customers while minimizing wasted marketing efforts.

In [433]:
# importance of features in the tree building
importances = dtree2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

The feature importance analysis reveals that Income is by far the most significant factor driving personal loan purchase, followed closely by the customer's Education level, and then Family size. The other features, such as CCAvg (Credit Card Average Spending) and CD_Account (Certificate of Deposit Account), have considerably lower but still relevant influence on the prediction.

Predicting on a single data point

The Post-Pruned Decision Tree (dtree3) is the superior choice because it achieved the highest performance metrics on the unseen test data, which is the true measure of a model's predictive power and generalizability.

In [434]:
%%time
# choosing a data point
applicant_details = X_test.iloc[:1, :]

# making a prediction
approval_prediction = dtree3.predict(applicant_details)

print(approval_prediction)
[0]
CPU times: user 4.13 ms, sys: 1e+03 ns, total: 4.13 ms
Wall time: 6.12 ms

Using the post-pruned Decision Tree model (dtree3) to predict the outcome for a single customer resulted in a classification of 0 (No Loan), and the entire inference process was completed quickly in 69.4 milliseconds of wall time, confirming the model's efficiency for immediate, high-volume operational use.

In [435]:
approval_likelihood = dtree3.predict_proba(applicant_details)

print(approval_likelihood[0, 1])
0.0

Key Drivers of Loan Purchase (Targeting Criteria)

The most effective campaign should focus its budget on the following customer attributes, as they are the primary drivers of loan acceptance:

Income (Most Important): This is the paramount predictor. Customers with high annual income (in thousands of dollars) are significantly more likely to take a personal loan. The marketing message should appeal to their financial capacity and ability to handle debt.

Education (Second Most Important): This feature, often combined with Income in the tree's split rules, suggests that the most receptive customer base is highly educated and affluent.

Family Size: The analysis indicated that a median family size of 3 is more receptive. This segment likely has rising financial needs (e.g., mortgages, education costs) that a personal loan could address.

CD Account (Positive Correlation): Counter-intuitively, customers who hold a Certificate of Deposit (CD_Account) are more likely to accept the loan. This suggests that the bank's successful target demographic is financially sophisticated and affluent, using the bank for both savings (CD_Account) and credit (Personal_Loan).

Conclusion

The Decision Tree modeling process, culminating in the Post-Pruned Decision Tree (dtree3), provides AllLife Bank with a highly accurate, fast, and transparent tool for targeted marketing. This model achieved the best generalization performance, with a test set Accuracy of 0.991 and a critical Precision of 0.948, ensuring the bank can maximize the return on its marketing investment.

Recommendations:

The bank should adopt a high-precision, tiered marketing strategy using the Post-Pruned Decision Tree (dtree3), which showed superior test performance (Precision: 0.948).

  1. High-Confidence Targeting (Tier 1)

Focus: Target customers predicted with P(Loan=1) above 0.85.

Method: Use the bank's most personalized, high-cost outreach methods, such as direct calls from relationship managers. The high Precision (94.8%) minimizes wasted budget.

Messaging: Frame the loan as an exclusive offer for valued, affluent clients to help them achieve a specific financial goal (e.g., investment, secondary property, education fund), appealing to their financial sophistication rather than basic need.

  1. Exclusion List (Budget Savings)

Action: Immediately exclude all customers whose profile leads to a final leaf node with P(Loan=1)=0.0.

Rationale: The model is 100% confident these individuals will not buy. Marketing to this group is pure budget waste, and resources should be shifted to Tier 1.

  1. Model Efficiency

The model’s inference time (under 70 milliseconds per customer) is extremely fast, allowing the bank to score the entire liability customer base quickly and enable real-time loan offers during customer service interactions.

The most receptive customers are defined primarily by high income and high education, refined by other financial factors. The bank should prioritize customers who fit the following profiles:

Very High Income: Customers whose income falls above the tree's primary income split threshold are the most likely to convert, regardless of other factors. Marketing should focus on messages about wealth management and investment leverage.High Income and Advanced

Education: This segment includes affluent customers with an Education Level 3 (Advanced/Professional degree). This highly refined group should be targeted with personalized communication from relationship managers, emphasizing their exclusive status.

Moderate-High Income with a CD_Account: This profile indicates a financially sophisticated customer who uses the bank for both major savings and strategic credit, making them a reliable conversion prospect. The loan should be promoted as a tool for financial diversification.

High Income with High Credit Card Spending (CCAvg): These customers demonstrate an active and comfortable relationship with credit, suggesting a higher propensity to accept a new loan offer.

High-Confidence Predictions (P_Loan=1): Any customer whose features lead the Decision Tree model to predict a very high probability of acceptance should be placed in the highest priority tier to justify the cost of personalized, high-value outreach.